-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update clustering.py #37
Conversation
Changes in clustering.py file to shift dependency from hlu09's tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work.
All dependencies of this notebook from custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook.
With these changes, no change in e-mission-server should be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better....
Have you run the code?
Please indicate "testing done".
Do you get the same graphs as the paper?
Note that the check is not "do I get a graph", it is "do I get the same graph"
Yes
Yet to confirm
Ongoing. The way I am planning to test this is I'll match and compare labels generated by both custom branch and master branch. This will verify that master branch and custom branch are functioning similarly. Is there any other way I can test this ? |
They differ. Let me check why this is happening. |
passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values, 'origin','destination' or 'origin-destination'.
Tested. This is running with no errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular change seems fine if it works.
- Turned out to be pretty simple after all?!
- I would like to see more information in the PR issue that it works (screenshots, information about the model indicating that it works)
- is this the only notebook that is affected by the change? I know that we have a notebook which generates the performance (accuracy/F-score) of various algorithms. I would expect that it would also need to be changed...
Almost All the other notebooks have dependencies on this module |
previous suggestions to improve readability.
This reverts commit 3e19b32.
Suggestions from previous comments to improve readability.
…VM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` now have no dependence on the custom branch. Results of plots are attached to show no difference in theie previous and current outputs.
`Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run.
This reverts commit bb404e9.
[Partially Tested] Suggested changes implemented bb404e9 `Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run.
Since this is partially tested, I'll keep the PR as draft, as soon as I have completed the final testing, I'll mark it as ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done, just a few minor changes
Fixed names of variables to be more self-explanatory
Not tested. Needs Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even smaller cleanups
1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server 2.Minor fixes
On the left are Plots after current testing, on the right are images from runs of notebook with @hlu109 custom branch : naive fixed-width clustering from the first user's data150m 50m 100m DBSCAN without SVM: home cluster with a blue cluster to the south that was merged inDBSCAN + SVM: home cluster and blue cluster to the south have been separated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@humbleOldSage I don't think you have addressed several of the prior review comments. They are fairly simple, so you might have just missed them - please check the review history carefully.
Please make sure that all comments are addressed before marking as ready for review.
TRB_label_assist/models.py
Outdated
@@ -378,13 +405,19 @@ def _distance_helper(self, tripa, tripb, loc_type): | |||
|
|||
copied from the Similarity class on the e-mission-server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are no longer on a custom branch, can't the distance calculation in e-mission-server
be re-used here? Why do we need copy-pasted code? I am generally again copy-pasting to support DRY. Since this change has already had multiple revisions, I am OK with deferring this to the next PR, but I want to make sure that it is not forgotten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are indeed using e-mission-server
.
Line 416:
dist= ecc.calDistance([pta_lon,pta_lat],[ptb_lon,ptb_lat])
ecc here is on e-mission-server
:
import emission.core.common as ecc
That's an old comment from Hannah that we should remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While predicting in the greedy_similarity_binning.py
on e-mission-server
, the flow goes like :
predict
-> _nearest_bin
->similar
( in e-mission-server/emission/analysis/modelling/similarity/similarity_metric.py
) ->similarity
( in e-mission-server/emission/analysis/modelling/similarity/od_similarity.py
) -> ecc.calDistance
.
Ans so currently we are using this ecc.calDistance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you have misunderstood the comment. The comment is not about ecc.calDistance
- it is clear that calDistance
is from
This says that _distanceHelper
is copied from e-mission-server, which is it
$ grep -r distance_helper emission/
emission//analysis/modelling/tour_model_first_only_orig/similarity.py: if not self.distance_helper(a, b):
emission//analysis/modelling/tour_model_first_only_orig/similarity.py: def distance_helper(self, a, b):
emission//analysis/modelling/tour_model/similarity.py: if not self.distance_helper(a, b):
emission//analysis/modelling/tour_model/similarity.py: def distance_helper(self, a, b):
emission//incomplete_tests/TestSimilarity.py: self.assertTrue(sim.distance_helper(b,c))
However, the implementation does seem to be a bit different, and I don't see a function with this name in trip_model. There must be an equivalent in trip_model to calculate distances between trips, which we should reuse here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than ecc.calDistance
, there just pre-processing in _distance_helper
in the form of coordinate extraction :
#tripa is taken from the test datframe.
#tripb is taken from the stored bin list.
pta_lat = tripa[[loc_type + '_lat']]
pta_lon = tripa[[loc_type + '_lon']]
if loc_type == 'start':
ptb_lat = tripb[1]
ptb_lon = tripb[0]
elif loc_type == 'end':
ptb_lat = tripb[3]
ptb_lon = tripb[2]
We do have extract_features
function ( at e-mission-server/emission/analysis/modelling/similarity/od_similarity.py
) on e-mission-server
that extracts latitude and longitude of trips, but it works just for Entry
type data ( since data frames are not used in e-mission-server
).
There must be an equivalent in trip_model to calculate distances between trips, which we should reuse here.
For the reason above, there isn't. However, this is what we can do to use the _nearest_bin
function ( e-mission-server/emission/analysis/modelling/trip_model/greedy_similarity_binning.py
) which is closest to _distance_helper
:
- convert
tripa
( the test trip) to entry type data usingdf_row_to_entry
( ine-mission-server/emission/storage/timeseries/builtin_timeseries.py
). - Pass this entry to the
_nearest_bin
function.
Let me know if this works, and I'll test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #37 (comment)
Since this change has already had multiple revisions, I am OK with deferring this to the next PR, but I want to make sure that it is not forgotten.
I am fine with returning to this later. However, eventually, we should simplify the codebase to either use only dataframes, or use only entries, or, if we are going to support some level of mix and match, have the utility functions support both combinations.
Can you please file an issue for this so that we don't forget it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed here #39 .
Minor Fixes to improve readability.
@humbleOldSage two more comments. |
Improved readability
Squash-merging since this is 21 commits for some fairly simple changes. |
* Update clustering.py Changes in clustering.py file to shift dependency from hlu09's tour_model_extended to main branch trip_model. Still need to change type of data being passed to fit function for this to work. * moving clustering_examples.ipynb to trip_model All dependencies of this notebook from custom branch are removed. There currently seems no errors while generating maps in clustering_examples notebook. * Removing changes in builtimeseries.py With these changes, no change in e-mission-server should be required. * Changes to support TRB_Label_Assist passing way of clustering to the e-mission-server. It was 'origin-destination' by default. Now can take one of three values, 'origin','destination' or 'origin-destination'. * suggestions previous suggestions to improve readability. * Revert "suggestions" This reverts commit 3e19b32. * Improving readability Suggestions from previous comments to improve readability. * making `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` compatible with changes in `clustering.py` and `mapping.py` files. Also porting these 3 notebooks to trip_model `cluster_performance.ipynb`, `generate_figs_for_poster` and `SVM_decision_boundaries` now have no dependence on the custom branch. Results of plots are attached to show no difference in theie previous and current outputs. * Unified Interface for fit function Unified Interface for fit function across all models. Passing 'Entry' Type data from the notebooks till the Binning functions. Default set to 'none'. * Fixing `models.py` to support `regenerate_classification_performance_results.py` Prior to this update, `NaiveBinningClassifier` in 'models.py' had dependencies on both of tour model and trip model. Now, this classifier is completely dependent on trip model. All the other notebooks (except `classification_performance.ipynb`) were tested as well and they are working as usual. Other minor fixes to support previous changes. * [PARTIALLY TESTED] Single database read and Code Cleanuo 1. removed mentions of `tour_model` or `tour_model_first_only` . 2. removed two reads from database. 3. Removed notebook outputs ( this could be the reason a few diffs are too big to view) * Delete TRB_label_assist/first_trial_results/cv results DBSCAN+SVM (destination).csv not required. * Reverting Notebook Reverting notebooks to initial state, since running on the browser messed up the cell index numbers. This was causing unnecessary git diffs even when no changes were made. running on VS code should resolve this. WIll do the subsequent changes on VS code and commit again. * [Partially Tested]Handled Whitespaces Whitespaces corrected. * [Partially Tested] Suggested changes implemented `Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run. * Revert "[Partially Tested] Suggested changes implemented" This reverts commit bb404e9. * [Partially Tested] Suggested changes implemented [Partially Tested] Suggested changes implemented bb404e9 `Classification_performance` and `regenerate_classification_performance_results.py` are not tested yet as they would take too long to run. The itertools removal in these two files is tested in other notebooks and it works. Other files, like models.py will be tested once any of the above two are run. * Minor variable fixes Fixed names of variables to be more self-explanatory * [TESTED] All the notebooks and files are tested 1. Change in models file a.t. changes in greedy_similarity_binning in e-mission-server 2.Minor fixes * Minor Fixes Minor Fixes to improve readability. * Minor Fixes in models.py Improved readability
Changes in
clustering.py
file to shift dependency from hlu09'stour_model_extended
to main branch'strip_model
. Still need to change type of data being passed to fit function for this to work. Marked with a TODO. Explained in detail at #35 (comment)